title: “Parallel dynamics of gene content evolution across independent endosymbiont lineages.”

authors:

date: “09 April 2024”



BACKGROUND:

Abstract: Endosymbiosis has been involved in some of the major transitions in the history of life, including the origin of the eukaryotic cell. Besides the many evolutionary implications for the two partners, this process has consistently led to the emergence of highly reduced genomes in prokaryotes. So far, the processes that shape gene content throughout endosymbiosis have not been investigated in a broad phylogenetic framework encompassing multiple independent shifts in lifestyle. Using the Enterobacterales order of Gammaproteobacteria as a representative model, a novel phylogenomic framework was inferred across more than two hundred genomes, encompassing inferences on single copy genes concatenation and gene families. A massive spike in gene loss is consistently associated with the shift to endosymbiosis, followed by a marginal increase in gene loss rates compared to free-living species. Our results also highlight a role for gene acquisition reduction in the process of genome erosion occurring after the shift to endosymbiosis. Furthermore, a significant correlation was retrieved in the frequencies of gene family loss across independent endosymbiotic clades: genes with more conserved functions and stronger sequence evolution constraints are lost less frequently, and thus the correlation appears to be driven by the different levels of gene essentiality. This study highlights how constraints shape endosymbionts’ parallel dynamics of gene content evolution.



1 - PHYLOGENETIC INFERENCES:



code for concatenation-based species tree inferences:
iqtree2 -s min95%sp_p.ol.all_parts.aln -m TESTMERGE --rcluster 10 -spp min95%sp_p.ol.all_parts.prt -mset JTT,WAG,LG -bb 1000 -mrate G,R,E -T 32 --prefix all_parts_site_homogeneous

iqtree2 -s min95%sp_p.ol.good_part.aln -m TESTMERGE --rcluster 10 -spp min95%sp_p.ol.good_part.prt -mset JTT,WAG,LG -bb 1000 -mrate G,R,E -T 32 --prefix good_part_site_homogeneous 

iqtree2 -s min95%sp_p.ol.all_parts.aln -m TESTMERGE -mset JTT,WAG,LG,LG4M,LG4X,CF4,C10,C20,C30,C40,C50,C60 -madd JTT+C10,JTT+C20,JTT+C30,JTT+C40,JTT+C50,JTT+C60,WAG+C10,WAG+C20,WAG+C30,WAG+C40,WAG+C50,WAG+C60,LG+C10,LG+C20,LG+C30,LG+C40,LG+C50,LG+C60 -bb 1000 -T 32 --prefix all_parts_site_heterogeneous

iqtree2 -s min95%sp_p.ol.good_part.aln -m TESTMERGE -mset JTT,WAG,LG,LG4M,LG4X,CF4,C10,C20,C30,C40,C50,C60 -madd JTT+C10,JTT+C20,JTT+C30,JTT+C40,JTT+C50,JTT+C60,WAG+C10,WAG+C20,WAG+C30,WAG+C40,WAG+C50,WAG+C60,LG+C10,LG+C20,LG+C30,LG+C40,LG+C50,LG+C60 -bb 1000 -T 32 --prefix good_part_site_heterogeneous

iqtree2 -s min95%sp_p.ol.RS4.all_parts.aln -m TESTMERGE -rcluster 10 -spp min95%sp_p.ol.RS4.all_parts.prt -mset JC,F81,GTR -bb 1000 -mrate G,R,E -T 16 --prefix SR4_all_parts_site_homogeneous

iqtree2 -s min95%sp_p.ol.RS4.good_part.aln -m TESTMERGE -rcluster 10 -spp min95%sp_p.ol.RS4.good_part.prt -mset JC,F81,GTR -bb 1000 -mrate G,R,E -T 16 --prefix SR4_good_part_site_homogeneous

iqtree2 -s min95%sp_p.ol.RS4.all_parts.aln -m TEST -mset JC,F81,GTR -mdef SR4_profileModels_IQtree.nxs -madd SR4C10JC,SR4C20JC,SR4C30JC,SR4C40JC,SR4C50JC,SR4C60JC,SR4C10F81,SR4C20F81,SR4C30F81,SR4C40F81,SR4C50F81,SR4C60F81,SR4C10GTR,SR4C20GTR,SR4C30GTR,SR4C40GTR,SR4C50GTR,SR4C60GTR -bb 1000 -mrate G,R,E -T 32 --prefix SR4_all_parts_site_heterogeneous

iqtree2 -s min95%sp_p.ol.RS4.good_part.aln -m TEST -mset JC,F81,GTR -mdef SR4_profileModels_IQtree.nxs -madd SR4C10JC,SR4C20JC,SR4C30JC,SR4C40JC,SR4C50JC,SR4C60JC,SR4C10F81,SR4C20F81,SR4C30F81,SR4C40F81,SR4C50F81,SR4C60F81,SR4C10GTR,SR4C20GTR,SR4C30GTR,SR4C40GTR,SR4C50GTR,SR4C60GTR -bb 1000 -mrate G,R,E -T 32 --prefix SR4_good_part_site_heterogeneous



code for gene families tree inferences:

We build up gene families trees for each OG using two different dataset and the ParGene pipeline: (i) aa sequences for all gene families, (ii) SR4-recoded aa sequences for all gene families. the “\(Aln_dir" point to the directory were are stored original alignments and the recoded ones. "\)out” is the out directory.

pargenes.py -a "$Aln_dir" -o "$out" -c 16 -d aa -m --modeltest-criteria BIC --modeltest-perjob-cores 4



code for gene families-based species tree inferences:

We performed three gene families based tree inferences (i) all gene families (aa alignments and trees); (ii) gene families that pass the iqtree stationary/homogeneity test (aa alignments and trees) and (iii) all gene families but SR4-recoded (SR4-recoded alignments and trees). The GeneRax comand was the same for all analyses changing only the family file (“$Family_file” variable).

generax --families "$Family_file" --si-strategy HYBRID --species-tree MiniNJ --rec-model UndatedDTL --per-family-rates --prune-species-tree --si-estimate-bl --si-quartet-support --prefix "$out"



code for gene trees corrections and reconciliations:

The species trees resulting from SpeciesRax analyses were re-rooted to correct the root position and all gene trees were re-corrected for 3 rounds and reconciled infering per-species DTL rates

generax -f "$Family_file" -s "$tree" --per-species-rates --strategy SPR -p "$out" --max-spr-radius 3



R library requirements:
library(phytools)
library(phangorn)
library(ggtree)
library(ggplot2)
library(reshape2)
library(TreeDist)
library(ComplexHeatmap)
library(geiger)
library(ape)
library(gridExtra)
library(phylotools)
library(ggfortify)
library(ggExtra)
library(treeio)
library(tidyr)
library(tidytree)
library(plotly)
library(ggsignif)
library(ggpubr)
library(stringr)
library(forcats)
library(dplyr)
library(cowplot)
library(ggrepel)
library(extrafont)
library(corrplot)
library(rstatix)
library(ggcorrplot)
library(ggwordcloud)
library(patchwork)



2 - TOPOLOGY ANALYSES:

Here we compare the topologies we found



import data pt. 1:

Here we import the different phylogenetic inferences along with the tips tratis:

concat_all_hom_aa <- read.tree("phylogenetic_resolutions/concatenation_all_genes_site_homogeneous_aa.nwk")
concat_all_het_aa <- read.tree("phylogenetic_resolutions/concatenation_all_genes_site_heterogeneous_aa.nwk")

concat_hom_hom_aa <- read.tree("phylogenetic_resolutions/concatenation_homogeneous_genes_site_homogeneous_aa.nwk")
concat_hom_het_aa <- read.tree("phylogenetic_resolutions/concatenation_homogeneous_genes_site_heterogeneous_aa.nwk")

concat_all_hom_SR4 <- read.tree("phylogenetic_resolutions/concatenation_all_genes_site_homogeneous_SR4.nwk")
concat_all_het_SR4 <- read.tree("phylogenetic_resolutions/concatenation_all_genes_site_heterogeneous_SR4.nwk")

concat_hom_hom_SR4 <- read.tree("phylogenetic_resolutions/concatenation_homogeneous_genes_site_homogeneous_SR4.nwk")
concat_hom_het_SR4 <- read.tree("phylogenetic_resolutions/concatenation_homogeneous_genes_site_heterogeneous_SR4.nwk")

genfam_all_aa <- read.tree("phylogenetic_resolutions/SpeciesRax_all_genes_aa.nwk")
genfam_hom_aa <- read.tree("phylogenetic_resolutions/SpeciesRax_homogeneous_genes_aa.nwk")
genfam_all_SR4 <- read.tree("phylogenetic_resolutions/SpeciesRax_all_genes_SR4.nwk")

genfam_all_SR4 <- drop.tip(genfam_all_SR4, "NZ-LG025083.1")

annot <- read.csv("sp_annotations.csv", header=T)
sp_traits <- read.csv("sp_traits.csv", header=T)

row.names(annot) <- annot$species
row.names(sp_traits) <- sp_traits$species

info <- merge(annot, sp_traits, by = "row.names")
row.names(info) <- info$Row.names

info <- info[,-c(1,2)]



Fig. S2 - visualizing differences among inferences:

A more detailed way to take a look at the differences among our inferences by plotting them side by side.

pdf(file="FigS2.pdf", width=25, height=25)

ref_tree <- genfam_all_SR4
ref_tree <- drop.tip(ref_tree,c("NZ-CP028926.1","NZ-CP009610.1"))                               # drop outgroups
ref_tree <- reroot(ref_tree,which((ref_tree$tip.label == "NZ-CP050969.1") == "TRUE"))           # reroot on Plesiomonas shigelloides
ref_tree$tip.label <- gsub("\\.[0-9]", "",ref_tree$tip.label )
ref_tree$edge.length<-NULL

other_tree <- concat_all_het_SR4
other_tree <- drop.tip(other_tree,c("NZ-CP028926.1","NZ-CP009610.1"))                           # drop outgroups
other_tree <- reroot(other_tree,which((other_tree$tip.label == "NZ-CP050969.1") == "TRUE"))     # reroot on Plesiomonas shigelloides
other_tree$tip.label <- gsub("\\.[0-9]", "",other_tree$tip.label )
other_tree$edge.length<-NULL

info <- merge(annot, sp_traits, by = "row.names")
row.names(info) <- info$Row.names

info <- info[match(ref_tree$tip.label, info$species.y),]
ref_tree$tip.label <- info$sp

info <- info[match(other_tree$tip.label, info$species.y),]
other_tree$tip.label <- info$sp

vs<-cophylo(ref_tree,other_tree,rotate=T)
## Rotating nodes to optimize matching...
## Done.
plot(vs,link.lwd=0.5,link.lty="solid",
     link.col=make.transparent("darkgray ",0.75),fsize=0.5)
title(main="gene families / all gene families / SR4 recoding (11) VS concatenation / all single copy genes / mixture model / SR4 recoding (04)", line = -1, cex.main = 2)

nodelabels.cophylo(round(as.numeric(ref_tree$node.label),2), cex=0.5, which="left", frame = "none")
nodelabels.cophylo(as.numeric(other_tree$node.label), cex=0.5, which="right", frame = "none")

info <- merge(annot, sp_traits, by = "row.names")
info <- info[match(ref_tree$tip.label, info$sp),]
state <- setNames(as.factor(info$state),info$sp)
cols<-setNames(c("orange","#c0ecf5"),levels(state))
tiplabels.cophylo(pie=to.matrix(state[vs$trees[[1]]$tip.label],levels(state)),
    piecol=cols,cex=0.1,which="left")
info <- merge(annot, sp_traits, by = "row.names")
info <- info[match(other_tree$tip.label, info$sp),]
state <- setNames(as.factor(info$state),info$sp)
tiplabels.cophylo(pie=to.matrix(state[vs$trees[[2]]$tip.label],levels(state)),
    piecol=cols,cex=0.1,which="right")


dev.off() 
## quartz_off_screen 
##                 2



export trees for subsequent analyses:
tree <- genfam_all_SR4
tree <- reroot(tree,which((tree$tip.label == "NZ-CP050969.1") == "TRUE"))             # reroot on Plesiomonas shigelloides
write.tree(tree, "./DLT_analyses_SpeciesRax_all_genes_SR4/DLT_analyses_SpeciesRax_all_genes_SR4_starting_tree.nwk")

tree <- concat_all_het_SR4
tree <- reroot(tree,which((tree$tip.label == "NZ-CP050969.1") == "TRUE"))             # reroot on Plesiomonas shigelloides
write.tree(tree, "./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4_starting_tree.nwk")



3 - ANALYSES ON “BEST” TOPOLOGY:



D/L/T of gene families on the “gene families + all genes + SR4” tree:
generax --families "$Family_file" --si-strategy HYBRID --species-tree MiniNJ --rec-model UndatedDTL --per-family-rates --prune-species-tree --si-estimate-bl --si-quartet-support --prefix "$out"
n=1; while read line; do if echo $line | grep -q Node; then ((n++)); ref=$(echo $line | awk '{print $1}'); echo $ref s$n; fi; done < counts.txt > conversion.txt
cp DLT_tree.newick DLT_tree_conv.newick; while read line; do one=$(echo $line | awk '{print $1}'); two=$(echo $line | awk '{print $2}'); gsed -i "s/$one/$two/g;" DLT_tree_conv.newick; done < conversion.txt
cp counts.txt counts_conv.txt; while read line; do one=$(echo $line | awk '{print $1}'); two=$(echo $line | awk '{print $2}'); gsed -i "s/$one/$two/g;" counts_conv.txt; done < conversion.txt 
cp rates.txt rates_conv.txt; while read line; do one=$(echo $line | awk '{print $1}'); two=$(echo $line | awk '{print $2}'); gsed -i "s/$one/$two/g;" rates_conv.txt; done < conversion.txt
for file in *speciesEventCounts.txt; do while read line; do one=$(echo $line | awk '{print $1}'); two=$(echo $line | awk '{print $2}'); gsed -i "s/$one/$two/g;" $file; done < conversion.txt; done



data wrangling pt. 1a:
node_annotation = read.table("DLT_analyses_SpeciesRax_all_genes_SR4/node_annotation.tsv",
                             sep="\t",header=T)   #Path to node annotation

rates = read.table("DLT_analyses_SpeciesRax_all_genes_SR4/rates_conv.txt",
                   sep=" ",header=F)    #Path to per_species_rates
colnames(rates) <- c("node","duplication_rate","loss_rate","transfer_rate")
counts = read.table("DLT_analyses_SpeciesRax_all_genes_SR4/counts_conv.txt",
                    sep=" ",header=F)  #Path to per_species_counts
colnames(counts) <- c("node","S","SL","D","T","TL")
counts$S <- gsub("S=", "", counts$S)
counts$SL <- gsub("SL=", "", counts$SL)
counts$D <- gsub("D=", "", counts$D)
counts$T <- gsub("T=", "", counts$T)
counts$TL <- gsub("TL=", "", counts$TL)

tmp  <- left_join(node_annotation, rates, by = 'node')
data  <- left_join(tmp, counts, by = 'node')



Fig. S3 - contingent losses across different endosymbiont lineages:
timeline <-  read.table("DLT_analyses_SpeciesRax_all_genes_SR4/lineages.tsv",sep="\t",header=T) 

timeline  <- left_join(timeline, data, by = 'node')

timeline <- timeline %>%
  group_by(coordinate, origin.x) %>% 
  summarise_at(vars("loss_rate"), mean)

tmp <- timeline %>%
  mutate(name2=origin.x)

FigS3 <- tmp %>%
  ggplot( aes(x=coordinate, y=loss_rate)) +
  geom_line( data=tmp %>% dplyr::select(-origin.x), aes(group=name2), color="lightgray", size=0.55, alpha=0.5) +
  geom_line( aes(color=origin.x), color="orange", size=1 ) + 
  theme(
      legend.position="none",
      plot.title = element_text(size=14),
      panel.grid = element_blank()
    ) + theme_bw() + facet_wrap(~origin.x, nrow=3) + geom_vline(xintercept = 0, linetype="dotted",size=.5) +
    theme(strip.background =element_rect(fill="white")) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) +
  labs(x = "branch") + labs(y = "mean loss rate")


ggsave("FigS3.pdf", FigS3, width = 6, height = 6)

FigS3



Fig. 2b - contingent gene losses overall:
timeline <-  read.table("DLT_analyses_SpeciesRax_all_genes_SR4/lineages.tsv",sep="\t",header=T) 

timeline  <- left_join(timeline, data, by = 'node')

timeline <- timeline %>%
  group_by(coordinate, origin.x) %>% 
  summarise_at(vars("loss_rate"), mean)

median(subset(timeline, coordinate == 0)$loss_rate)
## [1] 0.5924475
sd(subset(timeline, coordinate == 0)$loss_rate)
## [1] 0.2099138
median(subset(timeline, coordinate != 0)$loss_rate)
## [1] 0.1847425
sd(subset(timeline, coordinate != 0)$loss_rate)
## [1] 0.1231039
median(subset(timeline, coordinate == 0)$loss_rate)/median(subset(timeline, coordinate != 0)$loss_rate)
## [1] 3.206883
sd(subset(timeline, coordinate == 0)$loss_rate)/mean(subset(timeline, coordinate != 0)$loss_rate)
## [1] 1.238111
tmp <- timeline %>%
  mutate(name2=origin.x)

Fig2b <- ggplot(data=tmp, aes(x=coordinate, y=loss_rate, colour=origin)) + 
  xlab("") + theme_bw() + theme(text = element_text(size = 10)) + 
  geom_smooth(xintercept=-1, linetype="dashed", colour = "orange", size=1, alpha=0.2, colour=timeline$origin) +
  theme(legend.position="none", plot.title = element_text(size=14)) + 
  theme_bw() + theme(axis.title.x=element_blank()) + geom_vline(xintercept = 0, linetype="dotted",size=.5) +
  theme(strip.background = element_rect(fill="white")) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + labs(x = "branch") + labs(y = "gene loss rate") + labs(x = "branch")



Fig. 2a + 3a + 3c - DTL rates on the phylogeny:
tree <-  read.tree("DLT_analyses_SpeciesRax_all_genes_SR4/DLT_tree_conv.newick") #Path to specie tree with node labels

tree <- drop.tip(tree,c("NZ-CP028926.1","NZ-CP009610.1"))           # drop outgroups
data <- subset(data, node != "NZ-CP028926.1")
data <- subset(data, node != "NZ-CP009610.1")
data <- subset(data, node != "s209")
data <- subset(data, node != "s2")

#Create two dataframes for internal nodes and tips and order them based on the tree
data_node = data[grepl('^s|^Root', data$node),]
toDrop = rownames(data_node)
data_tip = data[!(row.names(data) %in% toDrop),]
data_node = data_node %>% arrange(factor(node, levels = tree$node.label))
data_tip = data_tip %>% arrange(factor(node, levels = tree$tip.label))
#Add node number
data_node$node = 1:tree$Nnode+Ntip(tree)
data_tip$node = 1:Ntip(tree)
#Merge the two dataframes
df = rbind(data_node,data_tip)
#Merge phylogenetic tree and data
tree2 <- full_join(tree, df, by = 'node')
#write out a nexus file with all annotations
write.beast(tree2,"DLT_analyses_SpeciesRax_all_genes_SR4/Enterobacteriales_annotated_tree.nxs")
#read the nexus file
beast_tree=read.beast("DLT_analyses_SpeciesRax_all_genes_SR4/Enterobacteriales_annotated_tree.nxs")

#Plot the loss tree
Fig2a <- ggtree(beast_tree, aes(color=as.numeric(loss_rate)), branch.length="none", size = 1, layout="fan") +
  scale_color_gradientn(colours=c("gray95", 'black')) +
  theme(legend.position = 'bottom') + ggtitle("loss rates") +
  geom_point2(aes(subset=(label=="s96")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s96")), label="A", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="AP012551.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="AP012551.1")), label="B", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCF-012026695.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCF-012026695.1")), label="C", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s156")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s156")), label="D", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s153")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s153")), label="E1", size=2, color='white', fontface = "bold") +
    geom_point2(aes(subset=(label=="NZ-CP006568.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="NZ-CP006568.1")), label="E2", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s142")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s142")), label="F", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s28")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s28")), label="G", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCA-001457715.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCA-001457715.1")), label="H", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s181")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s181")), label="I", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s192")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s192")), label="L", size=2, color='white', fontface = "bold")
 

#Plot the duplication tree
Fig3a <- ggtree(beast_tree, aes(color=as.numeric(duplication_rate)),branch.length="none", size = 1, layout="fan") +
  scale_color_gradientn(colours=c("gray95", 'black')) +
  theme(legend.position = 'bottom') + ggtitle("duplication rates") +
   geom_point2(aes(subset=(label=="s96")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s96")), label="A", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="AP012551.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="AP012551.1")), label="B", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCF-012026695.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCF-012026695.1")), label="C", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s156")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s156")), label="D", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s153")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s153")), label="E1", size=2, color='white', fontface = "bold") +
    geom_point2(aes(subset=(label=="NZ-CP006568.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="NZ-CP006568.1")), label="E2", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s142")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s142")), label="F", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s28")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s28")), label="G", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCA-001457715.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCA-001457715.1")), label="H", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s181")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s181")), label="I", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s192")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s192")), label="L", size=2, color='white', fontface = "bold")

#Plot the transfer tree
Fig3c <- ggtree(beast_tree, aes(color=as.numeric(transfer_rate)),branch.length="none", size = 1, layout="fan") +
  scale_color_gradientn(colours=c("gray95", 'black')) +
  theme(legend.position = 'bottom') + ggtitle("transfer rates") +
  geom_point2(aes(subset=(label=="s96")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s96")), label="A", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="AP012551.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="AP012551.1")), label="B", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCF-012026695.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCF-012026695.1")), label="C", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s156")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s156")), label="D", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s153")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s153")), label="E1", size=2, color='white', fontface = "bold") +
    geom_point2(aes(subset=(label=="NZ-CP006568.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="NZ-CP006568.1")), label="E2", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s142")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s142")), label="F", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s28")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s28")), label="G", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCA-001457715.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCA-001457715.1")), label="H", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s181")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s181")), label="I", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s192")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s192")), label="L", size=2, color='white', fontface = "bold")



Fig. 2c + 3b + 3d - DTL rates differences between symbiont and free-living branches:

NB: endosymbiosis-enstablishment branches are not included here.

all_branches_events <- data %>% filter(! str_detect(state, "EST"))

all_branches_events$loss_rate <- as.numeric(all_branches_events$loss_rate)
all_branches_events$duplication_rate <- as.numeric(all_branches_events$duplication_rate)
all_branches_events$transfer_rate <- as.numeric(all_branches_events$transfer_rate)

state <- ""
levels(state) <- c("endosymbionts","free-living")
cols<-setNames(c("orange","#c0ecf5"),levels(state))

all_branches_events$state[all_branches_events$state == "E"] <- "endosymbionts"
all_branches_events$state[all_branches_events$state == "N"] <- "free-living"

shapiro.test(all_branches_events$loss_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  all_branches_events$loss_rate
## W = 0.85423, p-value < 2.2e-16
p <- ks.test(all_branches_events$loss_rate[all_branches_events$state == "endosymbionts"],all_branches_events$loss_rate[all_branches_events$state == "free-living"],alternative="less")$p.value
Fig2c <- ggplot(all_branches_events, aes(y=loss_rate, x=state, fill=state)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) + ggtitle(paste("loss rates p=",round(p,4))) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(axis.title.x = element_blank()) + labs(y = "gene loss rates") + theme(plot.title = element_text(size = 10)) +
  coord_cartesian(ylim=c(0, .5))

median(subset(all_branches_events, state == "free-living")$loss_rate)
## [1] 0.08906095
sd(subset(all_branches_events, state == "free-living")$loss_rate)
## [1] 0.102837
median(subset(all_branches_events, state != "free-living")$loss_rate)
## [1] 0.131917
sd(subset(all_branches_events, state != "free-living")$loss_rate)
## [1] 0.1275177
shapiro.test(all_branches_events$duplication_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  all_branches_events$duplication_rate
## W = 0.25542, p-value < 2.2e-16
p <- ks.test(all_branches_events$duplication_rate[all_branches_events$state == "endosymbionts"],all_branches_events$duplication_rate[all_branches_events$state == "free-living"],alternative="greater")$p.value
Fig3b <- ggplot(all_branches_events, aes(y=duplication_rate, x=state, fill=state)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) + ggtitle(paste("duplication rates p=",round(p,4))) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(axis.title.x = element_blank()) + labs(y = "gene duplication rates") + theme(plot.title = element_text(size = 10)) + coord_cartesian(ylim=c(0, .05))

median(subset(all_branches_events, state == "free-living")$duplication_rate)
## [1] 0.003766365
sd(subset(all_branches_events, state == "free-living")$duplication_rate)
## [1] 0.1346826
median(subset(all_branches_events, state != "free-living")$duplication_rate)
## [1] 1e-07
sd(subset(all_branches_events, state != "free-living")$duplication_rate)
## [1] 0.1505295
shapiro.test(all_branches_events$transfer_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  all_branches_events$transfer_rate
## W = 0.84092, p-value < 2.2e-16
p <- ks.test(all_branches_events$transfer_rate[all_branches_events$state == "endosymbionts"],all_branches_events$transfer_rate[all_branches_events$state == "free-living"],alternative="greater")$p.value
Fig3d <- ggplot(all_branches_events, aes(y=transfer_rate, x=state, fill=state)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) + ggtitle(paste("transfer rates p= ",round(p,4))) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(axis.title.x = element_blank()) + labs(y = "gene transfer rates") + theme(plot.title = element_text(size = 10)) +
  coord_cartesian(ylim=c(0, .5))

median(subset(all_branches_events, state == "free-living")$transfer_rate)
## [1] 0.09097955
sd(subset(all_branches_events, state == "free-living")$transfer_rate)
## [1] 0.1283512
median(subset(all_branches_events, state != "free-living")$transfer_rate)
## [1] 0.0817777
sd(subset(all_branches_events, state != "free-living")$transfer_rate)
## [1] 0.050599



wrapping up Fig. 2 - losses:
Fig2tmp <- plot_grid(Fig2b,Fig2c,
                     ncol = 1,
                     axis = "lr", 
                     align = "hv",
                     labels = c("B","C"))

Fig2 <- plot_grid(Fig2a,Fig2tmp,
                     ncol = 2,
                     axis = "tr", 
                     labels = c("A",""))

ggsave("Fig2.pdf", Fig2, width = 12, height = 8)

Fig2



wrapping up Fig. 3 - transfer and duplication:
Fig3tmp1 <- plot_grid(Fig3a,Fig3c,
                     ncol = 1,
                     axis = "lr", 
                     align = "hv",
                     labels = c("A","C"))

Fig3tmp2 <- plot_grid(Fig3b,Fig3d,
                     ncol = 1,
                     axis = "lr", 
                     align = "hv",
                     labels = c("B","D"))

Fig3 <- plot_grid(Fig3tmp1,Fig3tmp2,
                  ncol = 2,
                  axis = "lr", 
                  align = "hv")

ggsave("Fig3.pdf", Fig3, width = 12, height = 12)

Fig3



Fig. S4 - DTL disparity across different endosymbiont lineages and association strength:

NB: endosymbiosis-enstablishment branches are not included here.

all_branches_events <- data %>% filter(! str_detect(state, "EST"))

all_branches_events$loss_rate <- as.numeric(all_branches_events$loss_rate)
all_branches_events$duplication_rate <- as.numeric(all_branches_events$duplication_rate)
all_branches_events$transfer_rate <- as.numeric(all_branches_events$transfer_rate)

all_branches_events$strength[all_branches_events$strength == "0"] <- "free-living"
all_branches_events$strength[all_branches_events$strength == "1"] <- "strict"
all_branches_events$strength[all_branches_events$strength == "2"] <- "loose"
all_branches_events$strength[all_branches_events$strength == "1/2"] <- "intermediate"

levels(state) <- c("free-living",
                   "strict",
                   "loose",
                   "intermediate")

cols<-setNames(c("#c0ecf5","orange","orange","orange"),levels(state))

median(subset(all_branches_events, strength == "free-living")$loss_rate)
## [1] 0.0883924
sd(subset(all_branches_events, strength == "free-living")$loss_rate)
## [1] 0.1119016
median(subset(all_branches_events, strength == "strict")$loss_rate)
## [1] 0.1550725
sd(subset(all_branches_events, strength == "strict")$loss_rate)
## [1] 0.1208997
median(subset(all_branches_events, strength == "intermediate")$loss_rate)
## [1] 0.120976
sd(subset(all_branches_events, strength == "intermediate")$loss_rate)
## [1] 0.05203798
median(subset(all_branches_events, strength == "loose")$loss_rate)
## [1] 0.160289
sd(subset(all_branches_events, strength == "loose")$loss_rate)
## [1] 0.203509
FigS4a <- ggplot(all_branches_events, aes(y=loss_rate, x=strength, fill=strength)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(aspect.ratio=1) + 
  theme(axis.title.x = element_blank()) + labs(y = "gene loss rates") + theme(plot.title = element_text(size = 10)) + 
  stat_compare_means(aes(label = after_stat(p.signif)), ref.group = "free-living", method = "wilcox.test", hjust = 2) +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  geom_hline(yintercept=median(all_branches_events$loss_rate[all_branches_events$strength == "free-living"]), linetype="dashed", color = "black", size=0.5) + coord_flip() 

median(subset(all_branches_events, strength == "free-living")$transfer_rate)
## [1] 0.0890355
sd(subset(all_branches_events, strength == "free-living")$transfer_rate)
## [1] 0.127596
median(subset(all_branches_events, strength == "strict")$transfer_rate)
## [1] 0.0949539
sd(subset(all_branches_events, strength == "strict")$transfer_rate)
## [1] 0.05277865
median(subset(all_branches_events, strength == "intermediate")$transfer_rate)
## [1] 0.0844569
sd(subset(all_branches_events, strength == "intermediate")$transfer_rate)
## [1] 0.03892602
median(subset(all_branches_events, strength == "loose")$transfer_rate)
## [1] 0.0451819
sd(subset(all_branches_events, strength == "loose")$transfer_rate)
## [1] 0.06546718
FigS4b <- ggplot(all_branches_events, aes(y=transfer_rate, x=strength, fill=strength)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(aspect.ratio=1) + 
  theme(axis.title.x = element_blank()) + labs(y = "gene transfer rates") + theme(plot.title = element_text(size = 10)) + 
  stat_compare_means(aes(label = after_stat(p.signif)), ref.group = "free-living", method = "wilcox.test", hjust = 2) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  geom_hline(yintercept=median(all_branches_events$transfer_rate[all_branches_events$strength == "free-living"]), linetype="dashed", color = "black", size=0.5) + coord_flip()

median(subset(all_branches_events, strength == "free-living")$duplication_rate)
## [1] 0.0037185
sd(subset(all_branches_events, strength == "free-living")$duplication_rate)
## [1] 0.1326349
median(subset(all_branches_events, strength == "strict")$duplication_rate)
## [1] 1e-07
sd(subset(all_branches_events, strength == "strict")$duplication_rate)
## [1] 0.00889187
median(subset(all_branches_events, strength == "intermediate")$duplication_rate)
## [1] 1e-07
sd(subset(all_branches_events, strength == "intermediate")$duplication_rate)
## [1] 0.1874086
median(subset(all_branches_events, strength == "loose")$duplication_rate)
## [1] 0.02718265
sd(subset(all_branches_events, strength == "loose")$duplication_rate)
## [1] 0.1967789
FigS4c <- ggplot(all_branches_events, aes(y=duplication_rate, x=strength, fill=strength)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(aspect.ratio=1) + 
  theme(axis.title.x = element_blank()) + labs(y = "gene duplication rates (square root transformed)") + theme(plot.title = element_text(size = 10)) + scale_y_continuous(trans='sqrt') + 
  stat_compare_means(aes(label = after_stat(p.signif)), ref.group = "free-living", method = "wilcox.test", hjust = 2) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  geom_hline(yintercept=median(all_branches_events$duplication_rate[all_branches_events$strength == "free-living"]), linetype="dashed", color = "black", size=0.5) + coord_flip()
all_branches_events <- data %>% filter(! str_detect(node, "root"))

all_branches_events["origin"][all_branches_events["origin"] == "EST"] <- "E"
all_branches_events["state"][all_branches_events["state"] == "EST"] <- "E"
all_branches_events <- drop_na(all_branches_events)

FigS4d <- ggplot(all_branches_events, aes(y=loss_rate, x=origin, fill=state)) + 
  geom_boxplot(adjust=1.5, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = c("orange","#c0ecf5")) +
  theme(legend.position = "none") + 
  theme(aspect.ratio=1) + 
  stat_compare_means(aes(label = after_stat(p.signif)), ref.group = "0", method = "wilcox.test", hjust = 2) + 
  geom_hline(yintercept=median(all_branches_events$loss_rate[all_branches_events$origin == "0"]), linetype="dashed", color = "black", size=0.5) + coord_flip() +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  labs(y = "gene loss rate")

FigS4e <- ggplot(all_branches_events, aes(y=transfer_rate, x=origin, fill=state)) + 
  geom_boxplot(adjust=1.5, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = c("orange","#c0ecf5")) +
  theme(legend.position = "none") + 
  theme(aspect.ratio=1) + 
  stat_compare_means(aes(label = after_stat(p.signif)), ref.group = "0", method = "wilcox.test", hjust = 2) + 
  geom_hline(yintercept=median(all_branches_events$transfer_rate[all_branches_events$origin == "0"]), linetype="dashed", color = "black", size=0.5) + coord_flip() +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  labs(y = "gene transfer rate")

FigS4f <- ggplot(all_branches_events, aes(y=sqrt(duplication_rate), x=origin, fill=state)) + 
  geom_boxplot(adjust=1.5, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = c("orange","#c0ecf5")) +
  theme(legend.position = "none") + 
  theme(aspect.ratio=1) + 
  stat_compare_means(aes(label = after_stat(p.signif)), ref.group = "0", method = "wilcox.test", hjust = 2) + 
  geom_hline(yintercept=sqrt(median(all_branches_events$duplication_rate[all_branches_events$origin == "0"])), linetype="dashed", color = "black", size=0.5) + coord_flip() +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  labs(y = "gene duplication rate (square root transformed)")
FigS4 <- plot_grid(FigS4a,
                   FigS4b,
                   FigS4c,
                   FigS4d,
                   FigS4e,
                   FigS4f,
                   nrow = 2,
                   axis = "lr", 
                   align = "hv",
                   labels = c("A","B","C","D","E","F"))

ggsave("FigS4.pdf", FigS4, width = 18, height = 12)

FigS4



data wrangling pt. 2a:
tokeep <- read.table("./DLT_analyses_SpeciesRax_all_genes_SR4/node_annotation.tsv",sep="\t",header=T)  

files <- list.files(pattern=".txt", path="./DLT_analyses_SpeciesRax_all_genes_SR4/PerFamily_counts/",full.names = T)

ancs <- vector()  # genes present in Enterobacterales ancestor 
endo <- vector()  # genes present in at least an endosymbiont clade MRCA
aftr <- vector()  # genes originated or acquired in Enterobacterales after any shift to endosymbiosis  
nevr <- vector()  # genes which where originated or acquired in free-living only

ubiq <- vector()  # genes present in all Enterobacterales spp.
up50 <- vector()  # genes present in more than 50% of Endosmybionts spp.
dn50 <- vector()  # genes present in less than 50% of Endosmybionts spp.
lost <- vector()  # genes absent in Endosmybionts spp.

j=1

  for (i in files) {
    #i="./DLT_analyses_SpeciesRax_all_genes_SR4/PerFamily_counts//OG???_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt"
    df2 <- read.table(i, sep = " ", header = T)
    df2[] <- lapply(df2, gsub, pattern=',', replacement='')
    OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt","",i)
    OG <- gsub("./DLT_analyses_SpeciesRax_all_genes_SR4/PerFamily_counts//","",OG)
    names(df2)=c("node","S","D","L","T","P","O")
    df2 <- merge(df2,tokeep, by="node")
    if (sum(length(df2[df2$node == "s3",]$P)) > 0) {
      ancs[[j]] <- OG
    }
    if (sum(as.numeric(df2[df2$state == "EST",]$L)) > 0) {
      endo[[j]] <- OG
      df2 <- df2[with(df2, ! grepl("^s", node)),]
      df2 <- subset(df2, state == "E" |  state == "EST")
      df2 <- subset(df2, P == 1)
      if (length(df2$node) == 75) {
        ubiq[[j]] <- OG
        } else if (length(df2$node) > 36) {
          up50[[j]] <- OG
          } else if (length(df2$node) > 0) {
            dn50[[j]] <- OG
            } else if (length(df2$node) == 0) {
              lost[[j]] <- OG
           }
      } else {
        df2 <- df2[with(df2, ! grepl("^s", node)),]
        if (sum(length(df2[df2$state == "E" | df2$state == "EST",]$P)) > 0) {
          aftr[[j]] <- OG
        } else {
          nevr[[j]] <- OG
    }
      }
    j = j + 1
  }

length(files)         # Among all genes
## [1] 10152
length(na.omit(endo)) # genes present in at least one free-livinig MRCA of an endosymbiotic clade
## [1] 4241
length(na.omit(aftr)) # genes originated or acquired in endosmybionts after any shift to endosymbiosis  
## [1] 1994
length(na.omit(nevr)) # genes which where originated or acquired in free-living only
## [1] 3917
# Among genes which were present in at least one free-living MRCA of an endosymbiotic clade
length(na.omit(ubiq)) # genes present in all endosmybiont spp.
## [1] 39
length(na.omit(up50)) # genes present in more than 50% of endosmybionts spp.
## [1] 413
length(na.omit(dn50)) # genes present in less than 50% of endosmybionts spp.
## [1] 2684
length(na.omit(lost)) # genes lost in all endosmybionts spp.
## [1] 1105



data wrangling pt. 3a:
files <- list.files(pattern=".txt", path="./DLT_analyses_SpeciesRax_all_genes_SR4/PerFamily_counts/",full.names = T)

tokeep <- read.table("./DLT_analyses_SpeciesRax_all_genes_SR4/node_annotation.tsv",sep="\t",header=T)  

names(tokeep)=c("Node","Trait","Origin")

tokeep$Origin[tokeep$Trait != "N"] <- "endosymbiont"
tokeep$Origin[tokeep$Trait == "N"] <- "free-living"

for (k in c("free-living","endosymbiont")){

  df=data.frame(OG=character(),mean=numeric())
  ortogruppo = vector()
  media_OG = vector()
  j = 1
  for (i in files) {
    df2 <- read.table(i, sep = " ", header = T)
    df2[] <- lapply(df2, gsub, pattern=',', replacement='')
    df2$speciations. <- as.numeric(df2$speciations.)
    df2$duplications. <- as.numeric(df2$duplications.)
    df2$losses. <- as.numeric(df2$losses.)
    df2$transfers. <- as.numeric(df2$transfers.)
    OG <- gsub("./PerFamily_counts//","",i)
    OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_speciesEventCounts.txt","",OG)
    names(df2)=c("Node","S","D","L","T","P","O")
    df2 <- merge(df2,tokeep, by="Node")
    df2 <- subset(df2, Origin == k)
    if (length(df2$Trait) >= 1) {
      df2$freq <- (df2$L / (df2$S + df2$L))
      df2 <- na.omit(df2)
      mean_OG=mean(df2$freq)
      #print(mean_OG)
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- mean_OG
    } else {
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- "NA"
    }
      j = j + 1
  }

  df=data.frame(OG=ortogruppo,mean_freq=media_OG)
  df$OG <- gsub("./DLT_analyses_SpeciesRax_all_genes_SR","",df$OG)
  df$origin <- k
  assign(paste("df", k, sep="_"), df)
}

df <- rbind(`df_free-living`,df_endosymbiont)
df$OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt","",df$OG)
df$mean_freq <- as.numeric(df$mean_freq)



Fig. S5 - families mean loss frequency delta across functional categories:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)
clean <- merge(df,dnds, by="OG")
clean<-subset(clean, free!="NA")
clean<-subset(clean, mean_freq!="NA")
clean<-subset(clean, mean_freq!="NaN")
clean<-subset(clean, free < 1)
clean<-subset(clean, endo < 1)
cog <- read.table("cog.tsv", sep="\t", header = T)
clean <- left_join(clean, cog, by = "OG")
clean <- clean %>% separate_rows(cog, sep = '(?<=.)(?=.)')
cog_conversion <- read.table("cog_conversion.tsv", sep="\t", header = T)
clean <- left_join(clean, cog_conversion, by = "cog")

clean <- subset(clean, cog != "A")
clean <- subset(clean, cog != "B")
clean <- subset(clean, cog != "W")
clean <- subset(clean, cog != "Z")
clean <- subset(clean, cog != "na")
clean <- subset(clean, cog != "-")

clean <- clean[clean$OG %in% endo,]  

delta <- clean %>% group_by(cog, origin) %>% summarise(median_freq=median(mean_freq))
delta <- as.data.frame(na.omit(delta))
delta <- reshape(delta, idvar = "cog", timevar = "origin", direction = "wide")
delta$delta <- (delta$median_freq.endosymbiont - delta$`median_freq.free-living`)
delta <- left_join(delta, cog_conversion, by = "cog")
delta <- delta %>% arrange(delta)
delta <- delta[order(delta$delta, decreasing = TRUE), ]

delta
##    cog median_freq.endosymbiont median_freq.free-living      delta
## 19   Q                0.5000000              0.08333333 0.41666667
## 18   I                0.3969697              0.06189046 0.33507924
## 17   P                0.3964286              0.06443032 0.33199825
## 16   T                0.3850160              0.06577968 0.31923634
## 15   G                0.4045455              0.08731323 0.31723222
## 14   K                0.3783784              0.08661417 0.29176421
## 13   S                0.3571429              0.06593674 0.29120612
## 12   E                0.3333333              0.06185567 0.27147766
## 11   V                0.3284314              0.06957995 0.25885143
## 10   C                0.2817460              0.05635889 0.22538714
## 9    M                0.2714988              0.04919355 0.22230522
## 8    U                0.2752525              0.05719639 0.21805614
## 7    N                0.2944820              0.08417269 0.21030929
## 6    L                0.2353267              0.03901555 0.19631114
## 5    H                0.2307692              0.03922862 0.19154061
## 4    O                0.2385013              0.05584768 0.18265362
## 3    F                0.2183099              0.04562044 0.17268942
## 2    D                0.1760155              0.03419347 0.14182200
## 1    J                0.1127098              0.05000000 0.06270983
##                                                      description
## 19  Secondary metabolites biosynthesis, transport and catabolism
## 18                                Lipid transport and metabolism
## 17                        Inorganic ion transport and metabolism
## 16                                Signal transduction mechanisms
## 15                         Carbohydrate transport and metabolism
## 14                                                 Transcription
## 13                                              Function unknown
## 12                           Amino acid transport and metabolism
## 11                                            Defense mechanisms
## 10                              Energy production and conversion
## 9                         Cell wall/membrane/envelope biogenesis
## 8  Intracellular trafficking, secretion, and vesicular transport
## 7                                                  Cell motility
## 6                          Replication, recombination and repair
## 5                              Coenzyme transport and metabolism
## 4   Posttranslational modification, protein turnover, chaperones
## 3                            Nucleotide transport and metabolism
## 2     Cell cycle control, cell division, chromosome partitioning
## 1                Translation, ribosomal structure and biogenesis
order <- delta$description

FigS5 <- ggplot(data=clean, aes(x = factor(description, order), y=mean_freq, fill=origin)) + 
  geom_boxplot(outlier.shape = NA, position = position_dodge(1) ) +
  coord_flip() + theme_bw() + ylim(0, 1.2) +
  stat_compare_means(aes(label = after_stat(p.signif)), size=5, label.y = 0.75, label.x = 10, paired = T) +
  scale_fill_manual(values=c("orange", "#c0ecf5")) + 
  xlab("") + theme(legend.position = "none") +
  theme(axis.text.y = element_text(size = 14)) + theme(axis.text.x = element_text(size = 14)) +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  labs(y = "mean loss frequency") + ylim(0,1)

ggsave("FigS5.pdf", FigS5, width = 12, height = 12)

FigS5



Fig. 4c - families mean loss frequency correlate with free-living selection regime:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)

clean <- merge(df,dnds, by = "OG")

colnames(clean) <- c("OG","mean_loss_freq","origin","dnds_free","dnds_endo")
clean <- spread(clean, key = origin, value = mean_loss_freq)
colnames(clean) <- c("OG","dnds_free","dnds_endo","mean_loss_freq_endo","mean_loss_freq_free")

clean <- clean[c("OG","dnds_free","mean_loss_freq_endo")]

clean <- subset(clean, dnds_free != "NA")
clean <- subset(clean, dnds_free < 1)

clean <- clean[clean$OG %in% endo,]  

dim(clean) # number of OGs considered in this analysis
## [1] 4143    3
Fig4c <- ggplot(clean, aes(x=dnds_free, y=mean_loss_freq_endo)) + 
  geom_point(size = 4, alpha =.05, pch = 20, color = "orange") + 
  geom_smooth(method = "lm", se = F, color="black", fullrange = TRUE, linetype = "dashed") +
  scale_linetype_manual(values ="dashed") +
  theme_bw() +
  xlab("dNdS in free-living") + 
  ylab("mean loss frequency in endosymbionts") +
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(strip.text = element_text(face = "bold", size = rel(0.45)), strip.background = element_rect(fill = "white", colour = "black", size = 1)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + theme(plot.title = element_text(size = 10)) + xlim(0,1) + ylim(0,1) +
  theme(legend.position = "none") + xlim(0,.75) +
  theme(axis.text.y = element_text(size = 14)) + theme(axis.text.x = element_text(size = 14)) + theme(aspect.ratio = 1)
  

Fig4c <- Fig4c + stat_cor(method = "spearman", cex = 5)

ggsave("Fig4c.pdf", Fig4c, width = 7, height = 7)

Fig4c



Fig. S8a - families mean loss frequency and free-living selective regime - boxplot:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)

clean <- merge(df,dnds, by = "OG")

colnames(clean) <- c("OG","mean_loss_freq","origin","dnds_free","dnds_endo")
clean <- spread(clean, key = origin, value = mean_loss_freq)
colnames(clean) <- c("OG","dnds_free","dnds_endo","mean_loss_freq_endo","mean_loss_freq_free")

clean <- clean[c("OG","dnds_free","mean_loss_freq_endo")]

clean <- subset(clean, dnds_free != "NA")
clean <- subset(clean, dnds_free < 1)

clean['state'] <- "NA"

clean[clean$OG %in% ubiq,]$state <- "ubiquitous" 
clean[clean$OG %in% up50,]$state <- "more than 50% spp." 
clean[clean$OG %in% dn50,]$state <- "less than 50% spp." 
clean[clean$OG %in% lost,]$state <- "allways lost" 

clean <- subset(clean, state != "NA")

my_comparisons <- list( 
  c("ubiquitous", "more than 50% spp."), c("ubiquitous", "less than 50% spp."), c("ubiquitous", "allways lost"),
  c("allways lost", "more than 50% spp."), c("allways lost", "less than 50% spp."), 
  c("more than 50% spp.", "less than 50% spp.")
  )

clean <-subset(clean, dnds_free < 1)

FigS8a <- ggplot(clean, aes(y=dnds_free, x=state)) + 
  geom_boxplot(width=0.2, outlier.shape = NA, fill="orange") + 
  theme_bw() +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  labs(y = "dNdS in free-living species") + labs(x = "gene present in at least one free-living MRCA of an endosymbiotic clade") + 
  stat_compare_means(comparisons = my_comparisons, method="wilcox.test") 



Fig. S7 - selection regime shift is consistent across different functional categories:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)
clean <- merge(df,dnds, by="OG")
clean<-subset(clean, free!="NA")
clean<-subset(clean, endo!="NA")
clean<-subset(clean, mean_freq!="NA")
clean<-subset(clean, mean_freq!="NaN")
clean<-subset(clean, free < 1)
clean<-subset(clean, endo < 1)

clean$delta <- clean$endo - clean$free
clean <- subset(clean, delta!="NA")

cog <- read.table("cog.tsv", sep="\t", header = T)
cog_clean <- left_join(clean, cog, by = "OG")
cog_clean <- cog_clean %>% mutate(PCT = ntile(delta, 1000))  %>%  data.frame()  

cog_clean <- cog_clean %>%separate_rows(cog, sep = '(?<=.)(?=.)')

cog_clean <- subset(cog_clean, cog != "S")
cog_clean <- subset(cog_clean, cog != "NA")
cog_clean <- subset(cog_clean, cog != "-")
cog_clean <- subset(cog_clean, cog != "B")  # too few genes for the cog "Chromatin structure and dynamics" just has one genes
cog_clean <- subset(cog_clean, cog != "W")  # too few genes for the cog "Extracellular structures"

cog_conversion <- read.table("cog_conversion.tsv", sep="\t", header = T)
cog_clean <- left_join(cog_clean, cog_conversion, by = "cog")

median(cog_clean$delta)
## [1] 0.07677
sd(cog_clean$delta)
## [1] 0.08757883
FigS7 <- cog_clean %>% 
  group_by(description) %>%
  mutate(mdelta = median(delta)) %>%
  subset(origin == "free-living")  %>%
  ggplot(aes(delta)) +
  geom_density(aes(y=..count..)) +
  theme_bw() + 
  geom_vline(aes(xintercept= 0), colour='black', linetype="dashed") +
  geom_vline(aes(xintercept= mdelta), colour='orange', size=0.5) +
  facet_wrap(~ description, nrow=3) +
  theme(strip.text = element_text(face = "bold", size = rel(.4)), strip.background = element_rect(fill = "white", colour = "black", size = 1)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + labs(x = "delta dNdS" ) + labs(y = "number of genes")


ggsave("FigS7.pdf", FigS7, width = 12, height = 6)

FigS7



data wrangling pt. 4a:
files <- list.files(pattern=".txt", path="./DLT_analyses_SpeciesRax_all_genes_SR4/PerFamily_counts/",full.names = T)

tokeep <- read.table("./DLT_analyses_SpeciesRax_all_genes_SR4/node_annotation.tsv",sep="\t",header=T)  

names(tokeep)=c("Node","Trait","Origin")

for (k in c("A","B","C","D","E1","E2","F","G","H","I","L","0")){
  df=data.frame(OG=character(),mean=numeric())
  ortogruppo = vector()
  media_OG = vector()
  j = 1
  for (i in files) {
    df2 <- read.table(i, sep = " ", header = T)
    df2[] <- lapply(df2, gsub, pattern=',', replacement='')
    df2$speciations. <- as.numeric(df2$speciations.)
    df2$duplications. <- as.numeric(df2$duplications.)
    df2$losses. <- as.numeric(df2$losses.)
    df2$transfers. <- as.numeric(df2$transfers.)
    OG <- gsub("./PerFamily_counts//","",i)
    OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_speciesEventCounts.txt","",OG)
    names(df2)=c("Node","S","D","L","T","P","O")
    df2 <- merge(df2,tokeep, by="Node")
    df2 <- subset(df2, Origin == k)
    if (length(df2$Trait) >= 1) {
      df2$freq <- (df2$L / (df2$S + df2$L))
      df2 <- na.omit(df2)
      mean_OG=mean(df2$freq)
      #print(mean_OG)
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- mean_OG
    } else {
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- "NA"
    }
    j = j + 1
  }

  df=data.frame(OG=ortogruppo,mean_freq=media_OG)
  df$OG <- gsub("./DLT_analyses_SpeciesRax_all_genes_SR","",df$OG)
  df$origin <- k
  assign(paste("df", k, sep="_"), df)
}

df <- rbind(df_A,df_B,df_C,df_D,df_E1,df_E2,df_F,df_G,df_H,df_I,df_L,df_0)
df$OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt","",df$OG)



Fig. S6 - functional categories ranking for families mean loss frequency:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)
clean <- merge(df,dnds, by="OG")
clean <- subset(clean, dnds != "NA")
clean <- subset(clean, mean_freq != "NA")
clean <- subset(clean, mean_freq != "NaN")
clean <- subset(clean, free != "NaN")
clean <- subset(clean, endo != "NaN")

cog <- read.table("cog.tsv", sep="\t", header = T)
clean_endo <- left_join(clean, cog, by = "OG")

clean_endo <- subset(clean_endo, cog != "NA")
clean_endo <- subset(clean_endo, cog != "-")

clean_endo <- clean_endo %>%separate_rows(cog, sep = '(?<=.)(?=.)')

clean_endo <- subset(clean_endo, cog != "A")
clean_endo <- subset(clean_endo, cog != "B")
clean_endo <- subset(clean_endo, cog != "W")
clean_endo <- subset(clean_endo, cog != "Z")

clean_endo <- left_join(clean_endo, cog_conversion, by = "cog")

clean_endo$mean_freq <- as.numeric(clean_endo$mean_freq)

clean_endo <- clean_endo[clean_endo$OG %in% endo,]  

A <- subset(clean_endo, origin == "A" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
A$rank <- rownames(A)
A$origin <- "A"

B <- subset(clean_endo, origin == "B" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
B$rank <- rownames(B)
B$origin <- "B"

C <- subset(clean_endo, origin == "C" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
C$rank <- rownames(C)
C$origin <- "C"

D <- subset(clean_endo, origin == "D" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
D$rank <- rownames(D)
D$origin <- "D"

E1 <- subset(clean_endo, origin == "E1" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
E1$rank <- rownames(E1)
E1$origin <- "E1"

E2 <- subset(clean_endo, origin == "E2" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
E2$rank <- rownames(E2)
E2$origin <- "E2"

effe <- subset(clean_endo, origin == "F" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
effe$rank <- rownames(effe)
effe$origin <- "effe"

G <- subset(clean_endo, origin == "G" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
G$rank <- rownames(G)
G$origin <- "G"

H <- subset(clean_endo, origin == "H" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
H$rank <- rownames(H)
H$origin <- "H"

I <- subset(clean_endo, origin == "I" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
I$rank <- rownames(I)
I$origin <- "I"

L <- subset(clean_endo, origin == "L" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
L$rank <- rownames(L)
L$origin <- "L"

zero <- subset(clean_endo, origin == "0" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
zero$rank <- rownames(zero)
zero$origin <- "zero"

loss_freq <- dplyr::bind_rows(A,B,C,D,E1,E2,effe,G,H,I,L,zero)
loss_freq$freq <- as.numeric(loss_freq$freq)
loss_freq$rank <- as.numeric(loss_freq$rank)

loss_freq$origin[loss_freq$origin == "effe"] <- "F"
loss_freq$origin[loss_freq$origin == "zero"] <- "free"

loss_freq <- loss_freq %>%  mutate(mycolor = ifelse(origin == "free", "#c0ecf5", "orange"))

order <- aggregate(rank~description,loss_freq,sum)
order <- order[order(order$rank),]$description

FigS6 <- ggplot(loss_freq, aes(x = factor(description, order), y=(as.numeric(rank)), label=origin, fill=mycolor)) + 
  geom_point() + 
  geom_label_repel(max.overlaps = Inf, min.segment.length = 0) + scale_fill_identity(c("#c0ecf5", "orange")) +
  theme_bw() + coord_flip() +
  xlab("") + ylab("") + theme(legend.position = "none") +
  theme(axis.text.y = element_text(size = 12)) + theme(axis.text.x = element_text(size = 12)) +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  labs(y = "mean loss frequency ranks")

ggsave("FigS6.pdf", FigS6, width = 12, height = 12)

FigS6



Fig. 4b - functional categories ranking for families mean loss frequency - wordcloud:
loss_freq_endo <- subset(loss_freq, mycolor == 'orange')[,c(1,3,4)]
loss_freq_endo <- aggregate(rank~description,loss_freq_endo,sum)

loss_freq_endo$description[loss_freq_endo$description == "Secondary metabolites biosynthesis, transport and catabolism"] <- "Secondary metabolites"
loss_freq_endo$description[loss_freq_endo$description == "Intracellular trafficking, secretion, and vesicular transport"] <- "Intracellular trafficking"
loss_freq_endo$description[loss_freq_endo$description == "Secondary metabolites biosynthesis, transport and catabolism"] <- "Secondary metabolites"

set.seed(sample(1:100,1))

Fig4b <- ggplot(loss_freq_endo, aes(label = description, size = rank, color = rank)) +
  geom_text_wordcloud(shape = "square", grid_size = 0, rm_outside = F, eccentricity = 2) +
  scale_size_area(max_size = 10) +
  theme_void() +
  scale_color_gradient(low = "darkgray", high = "orange") +
  theme(aspect.ratio = 1)

ggsave("Fig4b.pdf", Fig4b, width = 12, height = 12)

Fig4b



Fig. 4a - families mean loss frequency correlation across different endosymbiont clades:
df_cor <- data.frame(clean_endo[,c(-4,-5,-6,-7,-8,-9)])
df_cor <- reshape(df_cor, idvar = "OG", timevar = "origin", direction = "wide")
row.names(df_cor) <- df_cor$OG
df_cor <- df_cor[,c(-1)]

new_order <-  sort(colnames(df_cor),decreasing=F)
df_cor <- df_cor[, new_order]

colnames(df_cor) <- c("free-living","A","B","C","D","E1","E2","F","G","H","I","L")
my.palette <- get_palette(c("white", "white", "orange"), 10)
cor.test <- df_cor %>% cor_test(method = "spearman", use='pairwise.complete.obs')
cor.test$p <- p.adjust(cor.test$p, method = "hochberg", n = length(cor.test$p))
cor.mat <- as_cor_mat(cor.test)
p.mat <- cor_pmat(df_cor)

Fig4a <- ggcorrplot(cor.mat, p.mat = p.mat, sig.level = 0.001, hc.order = F, type = "lower", outline.col = "white", ggtheme = ggplot2::theme_bw,
           colors = c("white", "white", "orange"),method = "circle") + 
  theme(axis.text.y = element_text(size = 14)) + theme(axis.text.x = element_text(size = 14))

ggsave("Fig4a.pdf", Fig4a, width = 7, height = 7)

Fig4a



Fig. S8b - families mean loss frequency correlation with free-living selection regime (across different endosymbiont lineages + free-living):
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)

clean <- merge(df,dnds, by = "OG")

clean <- subset(clean, mean_freq != "NA")
clean$mean_freq <- as.numeric(clean$mean_freq)

clean$origin[clean$origin == 0] <- 'free-living'

clean <- clean[clean$OG %in% endo,]  

FigS8b <- ggplot(clean, aes(x=free, y=mean_freq)) + 
  geom_point(size = 4, alpha =.05, pch = 20, color = "orange") + 
  geom_smooth(method = "lm", se = F, color="black", fullrange = TRUE, linetype = "dashed") +
  scale_linetype_manual(values ="dashed") +
  facet_wrap( ~ origin) +
  stat_cor(method = "spearman", label.y = 0.75, size = 2) +
  theme_bw() +
  xlab("dNdS in free-living species") + 
  ylab("gene loss rate") +
  theme(plot.title = element_text(size = 10)) + 
  theme(aspect.ratio = 1) +
  facet_wrap( ~ origin) +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(strip.text = element_text(face = "bold", size = rel(0.45)), strip.background = element_rect(fill = "white", colour = "black", size = 1)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + theme(plot.title = element_text(size = 10)) + xlim(0,1) + ylim(0,1)



Fig. S8:
FigS8 <- plot_grid(FigS8a,FigS8b,
                  axis = "l", 
                  align = "hv",
                  labels = c("A","B"))

ggsave("FigS8.pdf", FigS8, width = 12, height = 6)

FigS8



DNA repair genes
clean <- merge(df,dnds, by="OG")
repair_genes <- read.table("DNA_repair_annotations.tsv", sep="\t",header=T)
tmp <- clean[clean$OG %in% repair_genes$OG, ]
tmp <- reshape(tmp, idvar = "OG", timevar = "origin", direction = "wide")
tmp <- tmp[,c(1:5)]
tmp <- merge(tmp, repair_genes, by="OG")
colnames(tmp) <- c("OG", "loss_F","dnds_F","dnds_E","loss_E", "gene")
repair_genes <- tmp[, c(1, 6, 2, 5, 3, 4)]

repair_genes

repair_genes$loss_F <- as.numeric(repair_genes$loss_F)
repair_genes$loss_E <- as.numeric(repair_genes$loss_E)
repair_genes$dnds_F <- as.numeric(repair_genes$dnds_F)
repair_genes$dnds_E <- as.numeric(repair_genes$dnds_E)
repair_genes <- na.omit(repair_genes)
  
# is the omega of specific DNA repair genes statistically higher than genes in the category "Replication, recombination and repair"??!
ks.test(
  repair_genes$dnds_E,
  cog_clean$endo[cog_clean$description == "Replication, recombination and repair"],
  alternative = "greater")

# is the omega of genes in "Replication, recombination and repair" statistically higher than all other genes??!
ks.test(
  cog_clean$endo[cog_clean$description == "Replication, recombination and repair"],
  cog_clean$endo[cog_clean$description != "Replication, recombination and repair"],
  alternative = "greater")

# is the loss frquency of specific DNA repair genes statistically higher than genes in the category "Replication, recombination and repair"??!
ks.test(
  repair_genes$loss_E,
  cog_clean$mean_freq[cog_clean$origin == "endosymbiont" & cog_clean$description == "Replication, recombination and repair"], 
  alternative = "greater")

# is the loss frquency of genes in "Replication, recombination and repair" statistically higher than all other genes??!
ks.test(
  cog_clean$mean_freq[cog_clean$origin == "endosymbiont" & cog_clean$description == "Replication, recombination and repair"],
  cog_clean$mean_freq[cog_clean$origin == "endosymbiont" & cog_clean$description != "Replication, recombination and repair"],
  alternative = "greater")



linearize the tree:
tree <- read.tree("DLT_analyses_SpeciesRax_all_genes_SR4/DLT_tree_conv.newick")
tree <- drop.tip(tree, "NZ-LG025083.1")
tree <- drop.tip(tree,c("NZ-CP028926.1","NZ-CP009610.1"))           # drop outgroups
tree <- reroot(tree,which((tree$tip.label == "NZ-CP050969.1") == "TRUE"))     # reroot on Plesiomonas shigelloides
tree$tip.label <- gsub("\\.[0-9]", "",tree$tip.label )

info <- merge(annot, sp_traits, by = "row.names")
row.names(info) <- info$Row.names

#info <- info[!info$species.y == "NZ-CP028926", ]
#info <- info[!info$species.y == "NZ-CP009610", ]

treec1 <- chronos(tree, model = "correlated", lambda = 1)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -45.7859 
## Optimising rates... dates... -45.7859 
## Optimising rates... dates... -45.78589 
## 
## log-Lik = -45.75865 
## PHIIC = 1336.63
lnLt1 <- attr(treec1, "ploglik")
treec2 <- chronos(tree, model = "correlated", lambda = 2)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.01406 
## Optimising rates... dates... -46.01406 
## Optimising rates... dates... -46.01406 
## 
## log-Lik = -46.10371 
## PHIIC = 1338.85
lnLt2 <- attr(treec2, "ploglik")
treec3 <- chronos(tree, model = "correlated", lambda = 3)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.24338 
## Optimising rates... dates... -46.24338 
## Optimising rates... dates... -46.24328 
## 
## log-Lik = -46.32981 
## PHIIC = 1339.56
lnLt3 <- attr(treec3, "ploglik")
treec4 <- chronos(tree, model = "correlated", lambda = 4)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.39016 
## Optimising rates... dates... -46.39016 
## Optimising rates... dates... -46.38921 
## 
## log-Lik = -46.50299 
## PHIIC = 1340.41
lnLt4 <- attr(treec4, "ploglik")
treec5 <- chronos(tree, model = "correlated", lambda = 5)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.47846 
## Optimising rates... dates... -46.47846 
## Optimising rates... dates... -46.47368 
## 
## log-Lik = -46.59152 
## PHIIC = 1341.06
lnLt5 <- attr(treec5, "ploglik")
treec6 <- chronos(tree, model = "correlated", lambda = 6)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.69738 
## Optimising rates... dates... -46.69738 
## Optimising rates... dates... -46.69172 
## 
## log-Lik = -46.81291 
## PHIIC = 1340.76
lnLt6 <- attr(treec6, "ploglik")
treec7 <- chronos(tree, model = "correlated", lambda = 7)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.64066 
## Optimising rates... dates... -46.64066 
## Optimising rates... dates... -46.6327 
## 
## log-Lik = -46.80559 
## PHIIC = 1341.94
lnLt7 <- attr(treec7, "ploglik")
treec8 <- chronos(tree, model = "correlated", lambda = 8)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.94669 
## Optimising rates... dates... -46.94669 
## Optimising rates... dates... -46.94262 
## 
## log-Lik = -47.0448 
## PHIIC = 1341.35
lnLt8 <- attr(treec8, "ploglik")
treec9 <- chronos(tree, model = "correlated", lambda = 9)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.74922 
## Optimising rates... dates... -46.74922 
## Optimising rates... dates... -46.74632 
## 
## log-Lik = -46.94889 
## PHIIC = 1342.23
lnLt9 <- attr(treec9, "ploglik")
treec10 <- chronos(tree, model = "correlated", lambda = 10)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -46.58149 
## Optimising rates... dates... -46.58149 
## Optimising rates... dates... -46.55624 
## 
## log-Lik = -46.73676 
## PHIIC = 1345.25
lnLt10 <- attr(treec10, "ploglik")

lnLt <- c(lnLt1,lnLt2,lnLt3,lnLt4,lnLt5,lnLt6,lnLt7,lnLt8,lnLt9,lnLt10)

plot(lnLt)

treec <- chronos(tree, model = "correlated", lambda = 1)
## 
## Setting initial dates...
## Fitting in progress... get a first set of estimates
##          (Penalised) log-lik = -45.7859 
## Optimising rates... dates... -45.7859 
## Optimising rates... dates... -45.78589 
## 
## log-Lik = -45.75865 
## PHIIC = 1336.63
info <- info[match(treec$tip.label, info$species.y),]

treec$tip.label <- info$sp



Fig. 1 - endosymbiosis ancestral state reconstruction:
state <- setNames(as.factor(info$endosymbiont),info$sp)

DOLLO <- matrix(c(0,1,0,0),2,2)

fitARD<-ace(state,treec, model=DOLLO, type="discrete")

cols<-setNames(c("orange","turquoise"),levels(state))

plot.phylo(treec,type="fan", cex=0.45, lwd=1, label.offset=0.025)

tiplabels(pie=to.matrix(state[treec$tip.label],levels(state)),piecol=cols,cex=0.1)

nodelabels(node=1:treec$Nnode+Ntip(treec),
           pie=fitARD$lik.anc,piecol=cols,cex=0.15)

add.simmap.legend(colors=cols,prompt=FALSE,x=0.9*par()$usr[1],
                  y=0.8*par()$usr[3],fsize=0.8)

pdf(file="Fig1.pdf", width=15, height=15)

tree <- beast_tree
origina_annot <- read.table("phylogenetic_resolutions/list_of_origins.tsv", sep="\t", header=T)

tmp <- data
tmp$species <- gsub("\\.[0-9]", "",tmp$node )
tmp  <- left_join(tmp, origina_annot, by = 'species')
origina_annot  <- tmp[,c(1,14,15,16,17,18,19,20,21,22,23,24)]

wow <- origina_annot
wow <- wow[,c(1,3)]
colnames(wow) <- c("id","species")

p <- ggtree(tree, layout='circular', branch.length='none')  %<+% wow + geom_tiplab(aes(label=species), offset = 7, size=3,fontface = "italic") +
  theme(legend.position = "none")

p <- p + xlim(NA, 31)

p <- p + scale_color_manual(values=c("black", "white")) + geom_tippoint(aes(color=state)) + scale_color_manual(values=c("orange","orange","#c0ecf5")) +
  geom_point2(aes(subset=(label=="s96")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s96")), label="A", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="AP012551.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="AP012551.1")), label="B", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCF-012026695.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCF-012026695.1")), label="C", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s156")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s156")), label="D", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s153")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s153")), label="E1", size=2, color='white', fontface = "bold") +
    geom_point2(aes(subset=(label=="NZ-CP006568.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="NZ-CP006568.1")), label="E2", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s142")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s142")), label="F", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s28")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s28")), label="G", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="GCA-001457715.1")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="GCA-001457715.1")), label="H", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s181")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s181")), label="I", size=2, color='white', fontface = "bold") +
  geom_point2(aes(subset=(label=="s192")), shape=21, size=5, fill='black') +
    geom_text2(aes(subset=(label=="s192")), label="L", size=2, color='white', fontface = "bold")
 
rownames(origina_annot) <- origina_annot$node
origina_annot <- origina_annot[,c(5,6,7,8,9,10,11,12)]

gheatmap(p, origina_annot, offset=0.5, width=0.25, hjust=0, colnames_style(color="white"),
         colnames=FALSE, colnames_offset_y = 8) + 
  theme(legend.position = 'none') +
  scale_fill_manual(values=c("white","#fd7f6f", "#7eb0d5", "#b2e061", "#bd7ebe", "#ffd200", "#e69138", "#beb9db", "#fdcce5", "#8bd3c7", "lightgray","darkgray"))

dev.off() 
## quartz_off_screen 
##                 2



TESTING AN ALTERNATE TOPOLOGY:

Here we perform a large subset of analyses on an alternate topology, as a sensitivity test on how the phylogenetic resolution can impact our findings.



data wrangling pt. 1b:
node_annotation = read.table("DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/node_annotation.tsv",sep="\t",header=T)   #Path to node annotation

rates = read.table("DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/rates_conv.txt",sep=" ",header=F)             #Path to per_species_rates
colnames(rates) <- c("node","duplication_rate","loss_rate","transfer_rate")
counts = read.table("DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/counts_conv.txt",sep=" ",header=F)          #Path to per_species_counts
colnames(counts) <- c("node","S","SL","D","T","TL")
counts$S <- gsub("S=", "", counts$S)
counts$SL <- gsub("SL=", "", counts$SL)
counts$D <- gsub("D=", "", counts$D)
counts$T <- gsub("T=", "", counts$T)
counts$TL <- gsub("TL=", "", counts$TL)

tmp  <- left_join(node_annotation, rates, by = 'node')
data  <- left_join(tmp, counts, by = 'node')



Fig. S9a + S9b + S9c - DTL rates differ between symbiont and free-living branches:

NB: endosymbiosis-enstablishment branches are not included here.

all_branches_events <- data %>% filter(! str_detect(state, "EST"))

all_branches_events$loss_rate <- as.numeric(all_branches_events$loss_rate)
all_branches_events$duplication_rate <- as.numeric(all_branches_events$duplication_rate)
all_branches_events$transfer_rate <- as.numeric(all_branches_events$transfer_rate)

all_branches_events$state[all_branches_events$state == "E"] <- "endosymbiont"
all_branches_events$state[all_branches_events$state == "N"] <- "free-living"

state <- ""
levels(state) <- c("endosymbiont","free-living")
cols<-setNames(c("orange","#c0ecf5"),levels(state))

shapiro.test(all_branches_events$loss_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  all_branches_events$loss_rate
## W = 0.90402, p-value = 2.361e-15
p <- ks.test(all_branches_events$loss_rate[all_branches_events$state == "endosymbiont"],all_branches_events$loss_rate[all_branches_events$state == "free-living"])$p.value
FigS9a <- ggplot(all_branches_events, aes(y=loss_rate, x=state, fill=state)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) + ggtitle(paste("loss rates p=",round(p,4))) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  theme(axis.title.x = element_blank()) + labs(y = "gene loss rates") + theme(plot.title = element_text(size = 10)) +
  coord_cartesian(ylim=c(0, .5))

shapiro.test(all_branches_events$duplication_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  all_branches_events$duplication_rate
## W = 0.26407, p-value < 2.2e-16
p <- ks.test(all_branches_events$duplication_rate[all_branches_events$state == "endosymbiont"],all_branches_events$duplication_rate[all_branches_events$state == "free-living"])$p.value
FigS9b <-  ggplot(all_branches_events, aes(y=duplication_rate, x=state, fill=state)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) + ggtitle(paste("duplication rates p=",round(p,4))) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  theme(axis.title.x = element_blank()) + labs(y = "gene duplication rates") + theme(plot.title = element_text(size = 10)) +
  coord_cartesian(ylim=c(0, .05))

shapiro.test(all_branches_events$transfer_rate)
## 
##  Shapiro-Wilk normality test
## 
## data:  all_branches_events$transfer_rate
## W = 0.81517, p-value < 2.2e-16
p <- ks.test(all_branches_events$transfer_rate[all_branches_events$state == "endosymbiont"],all_branches_events$transfer_rate[all_branches_events$state == "free-living"])$p.value
FigS9c <- ggplot(all_branches_events, aes(y=transfer_rate, x=state, fill=state)) + 
  geom_boxplot(adjust=1.5, width=0.2, outlier.shape = NA) + 
  theme_bw() + scale_fill_manual(values = cols) + ggtitle(paste("transfer rates p= ",round(p,4))) +
  theme(legend.position = "none") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) + 
  theme(axis.title.x = element_blank()) + labs(y = "gene transfer rates") + theme(plot.title = element_text(size = 10)) +
  coord_cartesian(ylim=c(0, .5))



Fig. S9d - contingent gene losses overall:
timeline <-  read.table("DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/lineages.tsv",sep="\t",header=T) 

timeline  <- left_join(timeline, data, by = 'node')

timeline <- timeline %>%
  group_by(coordinate, origin.x) %>% 
  summarise_at(vars("loss_rate"), mean)

tmp <- timeline %>%
  mutate(name2=origin.x)

FigS9d <- ggplot(data=tmp, aes(x=coordinate, y=loss_rate, colour=origin)) + 
  xlab("") + theme_bw() + theme(text = element_text(size = 10)) + 
  geom_smooth(xintercept=-1, linetype="dashed", colour = "orange", size=1, alpha=0.2, colour=timeline$origin) +
  theme(
      legend.position="none",
      plot.title = element_text(size=14),
      panel.grid = element_blank()
    ) + theme_bw() + theme(axis.title.x=element_blank()) + geom_vline(xintercept = 0, linetype="dotted",size=.5) +
    theme(strip.background =element_rect(fill="white")) + 
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank()) +
  theme(aspect.ratio=1) +
  labs(y = "gene loss rate") + labs(x = "branch") + theme(plot.title = element_text(size = 10))



files <- list.files(pattern=".txt", path="./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/PerFamily_counts/",full.names = T)

tokeep <- read.table("./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/node_annotation.tsv",sep="\t",header=T)  

ancs <- vector()  # genes present in Enterobacterales ancestor 
endo <- vector()  # genes present in at least an endosymbiont clade MRCA
aftr <- vector()  # genes originated or acquired in Enterobacterales after any shift to endosymbiosis  
nevr <- vector()  # genes which where originated or acquired in free-living only

ubiq <- vector()  # genes present in all Enterobacterales spp.
up50 <- vector()  # genes present in more than 50% of Endosmybionts spp.
dn50 <- vector()  # genes present in less than 50% of Endosmybionts spp.
lost <- vector()  # genes absent in Endosmybionts spp.

j=1

  for (i in files) {
    df2 <- read.table(i, sep = " ", header = T)
    df2[] <- lapply(df2, gsub, pattern=',', replacement='')
    OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt","",i)
    OG <- gsub("./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/PerFamily_counts//","",OG)
    names(df2)=c("node","S","D","L","T","P","O")
    df2 <- merge(df2,tokeep, by="node")
    if (sum(length(df2[df2$node == "s3",]$P)) > 0) {
      ancs[[j]] <- OG
    }
    if (sum(as.numeric(df2[df2$state == "EST",]$L)) > 0) {
      endo[[j]] <- OG
      df2 <- df2[with(df2, ! grepl("^s", node)),]
      df2 <- subset(df2, state == "E" |  state == "EST")
      df2 <- subset(df2, P == 1)
      if (length(df2$node) == 75) {
        ubiq[[j]] <- OG
        } else if (length(df2$node) > 36) {
          up50[[j]] <- OG
          } else if (length(df2$node) > 0) {
            dn50[[j]] <- OG
            } else if (length(df2$node) == 0) {
              lost[[j]] <- OG
           }
      } else {
        df2 <- df2[with(df2, ! grepl("^s", node)),]
        if (sum(length(df2[df2$state == "E" | df2$state == "EST",]$P)) > 0) {
          aftr[[j]] <- OG
        } else {
          nevr[[j]] <- OG
    }
      }
    j = j + 1
  }

length(files)         # Among all genes
## [1] 10152
length(na.omit(endo)) # genes present in at least one free-livinig MRCA of an endosymbiotic clade
## [1] 4476
length(na.omit(aftr)) # genes originated or acquired in endosmybionts after any shift to endosymbiosis  
## [1] 3081
length(na.omit(nevr)) # genes which where originated or acquired in free-living only
## [1] 2595
# Among genes which were present in at least one free-living MRCA of an endosymbiotic clade
length(na.omit(ubiq)) # genes present in all endosmybiont spp.
## [1] 6
length(na.omit(up50)) # genes present in more than 50% of endosmybionts spp.
## [1] 668
length(na.omit(dn50)) # genes present in less than 50% of endosmybionts spp.
## [1] 2946
length(na.omit(lost)) # genes lost in all endosmybionts spp.
## [1] 856
data wrangling pt. 3b:
files <- list.files(pattern=".txt", path="./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/PerFamily_counts/",full.names = T)

tokeep <- read.table("./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/node_annotation.tsv",sep="\t",header=T)  

names(tokeep)=c("Node","Trait","Origin")

tokeep$Origin[tokeep$Trait != "N"] <- "endosymbiont"
tokeep$Origin[tokeep$Trait == "N"] <- "free-living"

for (k in c("free-living","endosymbiont")){

  df=data.frame(OG=character(),mean=numeric())
  ortogruppo = vector()
  media_OG = vector()
  j = 1
  for (i in files) {
    df2 <- read.table(i, sep = " ", header = T)
    df2[] <- lapply(df2, gsub, pattern=',', replacement='')
    df2$speciations. <- as.numeric(df2$speciations.)
    df2$duplications. <- as.numeric(df2$duplications.)
    df2$losses. <- as.numeric(df2$losses.)
    df2$transfers. <- as.numeric(df2$transfers.)
    OG <- gsub("./PerFamily_counts//","",i)
    OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_speciesEventCounts.txt","",OG)
    names(df2)=c("Node","S","D","L","T","P","O")
    df2 <- merge(df2,tokeep, by="Node")
    df2 <- subset(df2, Origin == k)
    if (length(df2$Trait) >= 1) {
      df2$freq <- (df2$L / (df2$S + df2$L))
      df2 <- na.omit(df2)
      mean_OG=mean(df2$freq)
      #print(mean_OG)
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- mean_OG
    } else {
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- "1"
    }
      j = j + 1
  }

  df=data.frame(OG=ortogruppo,mean_freq=media_OG)
  df$OG <- gsub("./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR","",df$OG)
  df$origin <- k
  assign(paste("df", k, sep="_"), df)
}

df <- rbind(`df_free-living`,`df_endosymbiont`)
df$OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt","",df$OG)
df$mean_freq <- as.numeric(df$mean_freq)

#

dnds <- read.table("dnds_total.tsv", sep="\t", header = T)
clean <- merge(df,dnds, by="OG")
clean<-subset(clean, free!="NA")
clean<-subset(clean, mean_freq!="NA")
clean<-subset(clean, mean_freq!="NaN")
clean<-subset(clean, free < 1)
clean<-subset(clean, endo < 1)
cog <- read.table("cog.tsv", sep="\t", header = T)
clean <- left_join(clean, cog, by = "OG")
clean <- clean %>% separate_rows(cog, sep = '(?<=.)(?=.)')
cog_conversion <- read.table("cog_conversion.tsv", sep="\t", header = T)
clean <- left_join(clean, cog_conversion, by = "cog")



Fig. S9f:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)

clean <- merge(df,dnds, by = "OG")

colnames(clean) <- c("OG","mean_loss_freq","origin","dnds_free","dnds_endo")
clean <- spread(clean, key = origin, value = mean_loss_freq)
colnames(clean) <- c("OG","dnds_free","dnds_endo","mean_loss_freq_endo","mean_loss_freq_free")

clean <- clean[c("OG","dnds_free","mean_loss_freq_endo")]

clean <- subset(clean, mean_loss_freq_endo != "NA")
clean <- subset(clean, mean_loss_freq_endo != "NaN")
clean <- subset(clean, dnds_free != "NA")
clean <- subset(clean, dnds_free != "NaN")
clean <- subset(clean, dnds_free < 1)

clean <- clean[clean$OG %in% endo,]  

dim(clean) # number of OGs considered in this analysis
## [1] 4368    3
FigS9f <- ggplot(clean, aes(x=dnds_free, y=mean_loss_freq_endo)) + 
  geom_point(size = 4, alpha =.05, pch = 20, color = "orange") + 
  geom_smooth(method = "lm", se = F, color="black", fullrange = TRUE, linetype = "dashed") +
  scale_linetype_manual(values ="dashed") +
  theme_bw() +
  xlab("dNdS in free-living") + 
  ylab("mean loss frequency in endosymbionts") +
  theme(plot.title = element_text(size = 10)) + 
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank()) + 
  theme(strip.text = element_text(face = "bold", size = rel(0.45)), strip.background = element_rect(fill = "white", colour = "black", size = 1)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + theme(plot.title = element_text(size = 10)) + xlim(0,1) + ylim(0,1) +
  theme(legend.position = "none") + xlim(0,.75) +
  theme(axis.text.y = element_text(size = 14)) + theme(axis.text.x = element_text(size = 14)) + theme(aspect.ratio = 1)

FigS9f <- FigS9f + stat_cor(method = "spearman", cex = 5)



data wrangling pt. 4b:
files <- list.files(pattern=".txt", path="./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/PerFamily_counts/",full.names = T)

tokeep <- read.table("./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR4/node_annotation.tsv",sep="\t",header=T)  

names(tokeep)=c("Node","Trait","Origin")

for (k in c("A","B","C","D","E","F1",'F2',"G","H","I","L","M","0")){

  df=data.frame(OG=character(),mean=numeric())
  ortogruppo = vector()
  media_OG = vector()
  j = 1
  for (i in files) {
    df2 <- read.table(i, sep = " ", header = T)
    df2[] <- lapply(df2, gsub, pattern=',', replacement='')
    df2$speciations. <- as.numeric(df2$speciations.)
    df2$duplications. <- as.numeric(df2$duplications.)
    df2$losses. <- as.numeric(df2$losses.)
    df2$transfers. <- as.numeric(df2$transfers.)
    OG <- gsub("./PerFamily_counts//","",i)
    OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_speciesEventCounts.txt","",OG)
    names(df2)=c("Node","S","D","L","T","P","O")
    df2 <- merge(df2,tokeep, by="Node")
    df2 <- subset(df2, Origin == k)
    if (length(df2$Trait) >= 1) {
      df2$freq <- (df2$L / (df2$S + df2$L))
      df2 <- na.omit(df2)
      mean_OG=mean(df2$freq)
      #print(mean_OG)
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- mean_OG
    } else {
      ortogruppo[[j]] <- OG
      media_OG[[j]] <- "1"
    }
    j = j + 1
  }

  df=data.frame(OG=ortogruppo,mean_freq=media_OG)
  df$OG <- gsub("./DLT_analyses_concatenation_all_genes_site_heterogeneous_SR","",df$OG)
  df$origin <- k
  assign(paste("df", k, sep="_"), df)
}

df <- rbind(df_A,df_B,df_C,df_D,df_E,df_F1,df_F2,df_G,df_H,df_I,df_L,df_M,df_0)
df$OG <- gsub("_ol_fa_mafft_gappyout_NoSp_Renamed_ACGT_speciesEventCounts.txt","",df$OG)



Fig. S9g:
dnds <- read.table("dnds_total.tsv", sep="\t", header = T)
clean <- merge(df,dnds, by="OG")
clean <- subset(clean, dnds != "NA")
clean <- subset(clean, mean_freq != "NA")
clean <- subset(clean, mean_freq != "NaN")
clean <- subset(clean, free != "NaN")
clean <- subset(clean, endo != "NaN")

cog <- read.table("cog.tsv", sep="\t", header = T)
clean_endo <- left_join(clean, cog, by = "OG")

clean_endo <- subset(clean_endo, cog != "NA")
clean_endo <- subset(clean_endo, cog != "-")

clean_endo <- clean_endo %>%separate_rows(cog, sep = '(?<=.)(?=.)')

clean_endo <- subset(clean_endo, cog != "A")
clean_endo <- subset(clean_endo, cog != "B")
clean_endo <- subset(clean_endo, cog != "W")
clean_endo <- subset(clean_endo, cog != "Z")

clean_endo <- left_join(clean_endo, cog_conversion, by = "cog")

clean_endo$mean_freq <- as.numeric(clean_endo$mean_freq)

clean_endo <- clean_endo[clean_endo$OG %in% endo,]  

A <- subset(clean_endo, origin == "A" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
A$rank <- rownames(A)
A$origin <- "A"

B <- subset(clean_endo, origin == "B" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
B$rank <- rownames(B)
B$origin <- "B"

C <- subset(clean_endo, origin == "C" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
C$rank <- rownames(C)
C$origin <- "C"

D <- subset(clean_endo, origin == "D" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
D$rank <- rownames(D)
D$origin <- "D"

E1 <- subset(clean_endo, origin == "E1" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
E1$rank <- rownames(E1)
E1$origin <- "E1"

E2 <- subset(clean_endo, origin == "E2" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
E2$rank <- rownames(E2)
E2$origin <- "E2"

effe <- subset(clean_endo, origin == "F" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
effe$rank <- rownames(effe)
effe$origin <- "effe"

G <- subset(clean_endo, origin == "G" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
G$rank <- rownames(G)
G$origin <- "G"

H <- subset(clean_endo, origin == "H" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
H$rank <- rownames(H)
H$origin <- "H"

I <- subset(clean_endo, origin == "I" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
I$rank <- rownames(I)
I$origin <- "I"

L <- subset(clean_endo, origin == "L" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
L$rank <- rownames(L)
L$origin <- "L"

zero <- subset(clean_endo, origin == "0" ) %>%
  group_by(description) %>%
  summarise_at(vars(mean_freq), list(freq = mean)) %>% 
  arrange(freq)
zero$rank <- rownames(zero)
zero$origin <- "zero"

loss_freq <- dplyr::bind_rows(A,B,C,D,E1,E2,effe,G,H,I,L,zero)
loss_freq$freq <- as.numeric(loss_freq$freq)
loss_freq$rank <- as.numeric(loss_freq$rank)

loss_freq$origin[loss_freq$origin == "effe"] <- "F"
loss_freq$origin[loss_freq$origin == "zero"] <- "free"

loss_freq <- loss_freq %>%  mutate(mycolor = ifelse(origin == "free", "#c0ecf5", "orange"))

order <- aggregate(rank~description,loss_freq,sum)
order <- order[order(order$rank),]$description

FigS9g <- ggplot(loss_freq, aes(x = factor(description, order), y=(as.numeric(rank)), label=origin, fill=mycolor)) + 
  geom_point() + 
  geom_label_repel(max.overlaps = Inf, min.segment.length = 0) + scale_fill_identity(c("#c0ecf5", "orange")) +
  theme_bw() + 
  #coord_flip() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  xlab("") + ylab("") + theme(legend.position = "none") +
  theme(axis.text.y = element_text(size = 12)) + theme(axis.text.x = element_text(size = 12)) +
  theme(panel.grid.minor = element_blank(),panel.background = element_blank()) +
  labs(y = "mean loss frequency ranks")



Fig. S9e:
df_cor <- data.frame(clean_endo[,c(-4,-5,-6,-7,-8,-9)])
df_cor <- reshape(df_cor, idvar = "OG", timevar = "origin", direction = "wide")
row.names(df_cor) <- df_cor$OG
df_cor <- df_cor[,c(-1)]

new_order <-  sort(colnames(df_cor),decreasing=F)
df_cor <- df_cor[, new_order]

colnames(df_cor) <- c("free-living","A","B","C","D","E","F1","F2","G","H","I","L","M")
my.palette <- get_palette(c("white", "white", "orange"), 10)
cor.test <- df_cor %>% cor_test(method = "spearman", use='pairwise.complete.obs')
cor.test$p <- p.adjust(cor.test$p, method = "hochberg", n = length(cor.test$p))
cor.mat <- as_cor_mat(cor.test)
p.mat <- cor_pmat(df_cor)

FigS9e <- ggcorrplot(cor.mat, p.mat = p.mat, sig.level = 0.001, hc.order = F, type = "lower", outline.col = "white", ggtheme = ggplot2::theme_bw,
           colors = c("white", "white", "orange"),method = "circle") + 
  theme(axis.text.y = element_text(size = 14)) + theme(axis.text.x = element_text(size = 14)) + theme(legend.position = "none") 



wrapping up Fig. S9 - topology sensitivity test:
FigS9tmp <- (FigS9a | FigS9b | FigS9c) /
            (FigS9d | FigS9e | FigS9f) /
                      FigS9g

FigS9 <- FigS9tmp + plot_annotation(tag_levels = 'A')

ggsave("FigS9.pdf", FigS9, width = 12, height = 16)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
FigS9
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'



wrapping up - save workspace:
save.image(file="workspace.RData")